Goto

Collaborating Authors

 bimanual task


Sampling-Based Optimization with Parallelized Physics Simulator for Bimanual Manipulation

Hurova, Iryna, Dan, Alinjar, Kruusamäe, Karl, Singh, Arun Kumar

arXiv.org Artificial Intelligence

In recent years, dual-arm manipulation has become an area of strong interest in robotics, with end-to-end learning emerging as the predominant strategy for solving bimanual tasks. A critical limitation of such learning-based approaches, however, is their difficulty in generalizing to novel scenarios, especially within cluttered environments. This paper presents an alternative paradigm: a sampling-based optimization framework that utilizes a GPU-accelerated physics simulator as its world model. We demonstrate that this approach can solve complex bimanual manipulation tasks in the presence of static obstacles. Our contribution is a customized Model Predictive Path Integral Control (MPPI) algorithm, \textbf{guided by carefully designed task-specific cost functions,} that uses GPU-accelerated MuJoCo for efficiently evaluating robot-object interaction. We apply this method to solve significantly more challenging versions of tasks from the PerAct$^{2}$ benchmark, such as requiring the point-to-point transfer of a ball through an obstacle course. Furthermore, we establish that our method achieves real-time performance on commodity GPUs and facilitates successful sim-to-real transfer by leveraging unique features within MuJoCo. The paper concludes with a statistical analysis of the sample complexity and robustness, quantifying the performance of our approach. The project website is available at: https://sites.google.com/view/bimanualakslabunitartu .


TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

Im, Hokyun, Jeong, Euijin, Fu, Jianlong, Kolobov, Andrey, Lee, Youngwoon

arXiv.org Artificial Intelligence

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, $π_0$ which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.


VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation

Zhou, Huayi, Jia, Kui

arXiv.org Artificial Intelligence

Achieving generalizable bimanual manipulation requires systems that can learn efficiently from minimal human input while adapting to real-world uncertainties and diverse embodiments. Existing approaches face a dilemma: imitation policy learning demands extensive demonstrations to cover task variations, while modular methods often lack flexibility in dynamic scenes. We introduce VLBiMan, a framework that derives reusable skills from a single human example through task-aware decomposition, preserving invariant primitives as anchors while dynamically adapting adjustable components via vision-language grounding. This adaptation mechanism resolves scene ambiguities caused by background changes, object repositioning, or visual clutter without policy retraining, leveraging semantic parsing and geometric feasibility constraints. Moreover, the system inherits human-like hybrid control capabilities, enabling mixed synchronous and asynchronous use of both arms. Extensive experiments validate VLBiMan across tool-use and multi-object tasks, demonstrating: (1) a drastic reduction in demonstration requirements compared to imitation baselines, (2) compositional generalization through atomic skill splicing for long-horizon tasks, (3) robustness to novel but semantically similar objects and external disturbances, and (4) strong cross-embodiment transfer, showing that skills learned from human demonstrations can be instantiated on different robotic platforms without retraining. By bridging human priors with vision-language anchored adaptation, our work takes a step toward practical and versatile dual-arm manipulation in unstructured settings.


Learning Bimanual Manipulation via Action Chunking and Inter-Arm Coordination with Transformers

Motoda, Tomohiro, Hanai, Ryo, Nakajo, Ryoichi, Murooka, Masaki, Erich, Floris, Domae, Yukiyasu

arXiv.org Artificial Intelligence

Robots that can operate autonomously in a human living environment are necessary to have the ability to handle various tasks flexibly. One crucial element is coordinated bimanual movements that enable functions that are difficult to perform with one hand alone. In recent years, learning-based models that focus on the possibilities of bimanual movements have been proposed. However, the high degree of freedom of the robot makes it challenging to reason about control, and the left and right robot arms need to adjust their actions depending on the situation, making it difficult to realize more dexterous tasks. To address the issue, we focus on coordination and efficiency between both arms, particularly for synchronized actions. Therefore, we propose a novel imitation learning architecture that predicts cooperative actions. We differentiate the architecture for both arms and add an intermediate encoder layer, Inter-Arm Coordinated transformer Encoder (IACE), that facilitates synchronization and temporal alignment to ensure smooth and coordinated actions. To verify the effectiveness of our architectures, we perform distinctive bimanual tasks. The experimental results showed that our model demonstrated a high success rate for comparison and suggested a suitable architecture for the policy learning of bimanual manipulation.


You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations

Zhou, Huayi, Wang, Ruixiang, Tai, Yunxin, Deng, Yueci, Liu, Guiliang, Jia, Kui

arXiv.org Artificial Intelligence

Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency. Our project link is https://hnuzhy.github.io/projects/YOTO.


Large Language Models for Orchestrating Bimanual Robots

Chu, Kun, Zhao, Xufeng, Weber, Cornelius, Li, Mengdi, Lu, Wenhao, Wermter, Stefan

arXiv.org Artificial Intelligence

Although there has been rapid progress in endowing robots with the ability to solve complex manipulation tasks, generating control policies for bimanual robots to solve tasks involving two hands is still challenging because of the difficulties in effective temporal and spatial coordination. With emergent abilities in terms of step-by-step reasoning and in-context learning, Large Language Models (LLMs) have taken control of a variety of robotic tasks. However, the nature of language communication via a single sequence of discrete symbols makes LLM-based coordination in continuous space a particular challenge for bimanual tasks. To tackle this challenge for the first time by an LLM, we present LAnguage-model-based Bimanual ORchestration (LABOR), an agent utilizing an LLM to analyze task configurations and devise coordination control policies for addressing long-horizon bimanual tasks. In the simulated environment, the LABOR agent is evaluated through several everyday tasks on the NICOL humanoid robot. Reported success rates indicate that overall coordination efficiency is close to optimal performance, while the analysis of failure causes, classified into spatial and temporal coordination and skill selection, shows that these vary over tasks. The project website can be found at http://labor-agent.github.io


Stabilize to Act: Learning to Coordinate for Bimanual Manipulation

Grannen, Jennifer, Wu, Yilin, Vu, Brandon, Sadigh, Dorsa

arXiv.org Artificial Intelligence

Bimanual coordination is pervasive, spanning household activities such as cutting food, surgical skills such as suturing a wound, or industrial tasks such as connecting two cables. In robotics, the addition of a second arm opens the door to a higher level of task complexity, but comes with a number of control challenges. With a second arm, we have to reason about how to produce coordinated behavior in a higher dimensional action space, resulting in more computationally challenging learning, planning, and optimization problems. The addition of a second arm also complicates data collection--it requires teleoperating a robot with more degrees of freedom--which hinders our ability to rely on methods that require expert bimanual demonstrations. To combat these challenges, we can draw inspiration from how humans tackle bimanual tasks--specifically alternating between using one arm to stabilize parts of the environment, then using the other arm to act conditioned on the stabilized state of the world. Alternating stabilizing and acting offers a significant gain over both model-based and data-driven prior approaches for bimanual manipulation. Previous model-based techniques have proposed planning algorithms for bimanual tasks such as collaborative transport or scooping [1, 2, 3], but require hand-designed specialized primitives or follow predefined trajectories limiting their abilities to learn new skills or adapt. On another extreme, we turn to reinforcement learning (RL) techniques that do not need costly primitives. However, RL methods are notoriously data hungry and a high-dimensional bimanual action space further exacerbates this problem.


New dual-arm robot achieves bimanual tasks by learning from simulation

Robohub

The new Bi-Touch system, designed by scientists at the University of Bristol and based at the Bristol Robotics Laboratory, allows robots to carry out manual tasks by sensing what to do from a digital helper. The findings, published in IEEE Robotics and Automation Letters, show how an AI agent interprets its environment through tactile and proprioceptive feedback, and then control the robots' behaviours, enabling precise sensing, gentle interaction, and effective object manipulation to accomplish robotic tasks. This development could revolutionise industries such as fruit picking, domestic service, and eventually recreate touch in artificial limbs. Lead author Yijiong Lin from the Faculty of Engineering, explained: "With our Bi-Touch system, we can easily train AI agents in a virtual world within a couple of hours to achieve bimanual tasks that are tailored towards the touch. And more importantly, we can directly apply these agents from the virtual world to the real world without ...


Interactive Imitation Learning of Bimanual Movement Primitives

Franzese, Giovanni, Rosa, Leandro de Souza, Verburg, Tim, Peternel, Luka, Kober, Jens

arXiv.org Artificial Intelligence

Abstract--Performing bimanual tasks with dual robotic setups can drastically increase the impact on industrial and daily life applications. However, performing a bimanual task brings many challenges, like synchronization and coordination of the singlearm policies. This article proposes the Safe, Interactive Movement Primitives Learning (SIMPLe) algorithm, to teach and correct single or dual arm impedance policies directly from human kinesthetic demonstrations. Moreover, it proposes a novel graph encoding of the policy based on Gaussian Process Regression (GPR) where the single-arm motion is guaranteed to converge close to the trajectory and then towards the demonstrated goal. Factory assembly, logistics, and household applications of bimanual robots have been known for decades [7], [8]. Modern society is faced with the lack of workforce in various However, the increased number of Degrees of Freedom repetitive jobs like re-shelving products in supermarkets (DoFs) (the curse of dimensionality) implies an increased or handling heavy luggage in airports. Robots appear to be teaching complexity and the necessity of skilled human teachers the most promising solution to mitigate the negative effects of who knows how to interface with the bimanual robotic the declining workforce and perform these various complex platform. To work in variable and unstructured environments, In this paper we contribute with the Safe Interactive Movement robots must be dexterous and intelligent to quickly learn the Primitive Learning (SIMPLe) algorithm and propose: job while interacting safely with other robots, objects, and humans.